October 2020
Why do we want to plot data?
Looking at the data as a first step of analysis is always a good idea
A striking example of this is the “Datasaurus dozen”: a dull an not impressive dataset.
x and y, over 13 different conditionsdata/DatasaurusDozen.tsv) and compute mean and st.dev. by datasetdf <- read_tsv("data/DatasaurusDozen.tsv")
df %>%
group_by(dataset) %>%
summarise(mean_x = round(mean(x),2), mean_y = round(mean(y),2)) %>%
kable()
| dataset | mean_x | mean_y |
|---|---|---|
| away | 54.27 | 47.83 |
| bullseye | 54.27 | 47.83 |
| circle | 54.27 | 47.84 |
| dino | 54.26 | 47.83 |
| dots | 54.26 | 47.84 |
| h_lines | 54.26 | 47.83 |
| high_lines | 54.27 | 47.84 |
| slant_down | 54.27 | 47.84 |
| slant_up | 54.27 | 47.83 |
| star | 54.27 | 47.84 |
| v_lines | 54.27 | 47.84 |
| wide_lines | 54.27 | 47.83 |
| x_shape | 54.26 | 47.84 |
But if you plot it, you’ll see stark differences
Plotting allows one to convey a lot of information in a compact way
…and are honest to the data
so let’s start with a gallery of bad plots. Can you guess why they are bad?
Examples:
We will start by using the built-in dataset mpg
mpg
## # A tibble: 234 x 11 ## manufacturer model displ year cyl trans drv cty hwy fl class ## <chr> <chr> <dbl> <int> <int> <chr> <chr> <int> <int> <chr> <chr> ## 1 audi a4 1.8 1999 4 auto(l… f 18 29 p comp… ## 2 audi a4 1.8 1999 4 manual… f 21 29 p comp… ## 3 audi a4 2 2008 4 manual… f 20 31 p comp… ## 4 audi a4 2 2008 4 auto(a… f 21 30 p comp… ## 5 audi a4 2.8 1999 6 auto(l… f 16 26 p comp… ## 6 audi a4 2.8 1999 6 manual… f 18 26 p comp… ## 7 audi a4 3.1 2008 6 auto(a… f 18 27 p comp… ## 8 audi a4 quat… 1.8 1999 4 manual… 4 18 26 p comp… ## 9 audi a4 quat… 1.8 1999 4 auto(l… 4 16 25 p comp… ## 10 audi a4 quat… 2 2008 4 manual… 4 20 28 p comp… ## # … with 224 more rows
skimr::skim(mpg)
| Name | mpg |
| Number of rows | 234 |
| Number of columns | 11 |
| _______________________ | |
| Column type frequency: | |
| character | 6 |
| numeric | 5 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| manufacturer | 0 | 1 | 4 | 10 | 0 | 15 | 0 |
| model | 0 | 1 | 2 | 22 | 0 | 38 | 0 |
| trans | 0 | 1 | 8 | 10 | 0 | 10 | 0 |
| drv | 0 | 1 | 1 | 1 | 0 | 3 | 0 |
| fl | 0 | 1 | 1 | 1 | 0 | 5 | 0 |
| class | 0 | 1 | 3 | 10 | 0 | 7 | 0 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| displ | 0 | 1 | 3.47 | 1.29 | 1.6 | 2.4 | 3.3 | 4.6 | 7 | ▇▆▆▃▁ |
| year | 0 | 1 | 2003.50 | 4.51 | 1999.0 | 1999.0 | 2003.5 | 2008.0 | 2008 | ▇▁▁▁▇ |
| cyl | 0 | 1 | 5.89 | 1.61 | 4.0 | 4.0 | 6.0 | 8.0 | 8 | ▇▁▇▁▇ |
| cty | 0 | 1 | 16.86 | 4.26 | 9.0 | 14.0 | 17.0 | 19.0 | 35 | ▆▇▃▁▁ |
| hwy | 0 | 1 | 23.44 | 5.95 | 12.0 | 18.0 | 24.0 | 27.0 | 44 | ▅▅▇▁▁ |
ggplot2. Why?Advantages of ggplot2
grammar of graphics (Wilkinson, 2005)The basic idea: independently specify plot building blocks and combine them to create just about any kind of graphical display you want. Building blocks of a graph include:
statistical transformations
faceting
As in a grammar the minimal sentence is a subject in a plot the minimal object is data
ggplot(mpg)
In a grammar, you need a verb. In plots, this is axis
p <- ggplot(mpg, aes(x = displ, y = hwy)) p
But you also need an object. In ggplot, this is geoms
p + geom_point()
But you also need an object. In ggplot, this is geoms
p + geom_smooth()
You can add (+) as many geoms as you wish
p + geom_smooth()+geom_point()
++ to add color, fill, size, shape, etc…p + geom_point(aes(color=class))
p + geom_point(aes(size=cyl))
p + geom_point(aes(size = cyl, color=class))
p + geom_point(aes(shape=fl))
p + geom_point(aes(color=manufacturer, shape =fl, size = cyl))
ggplot()ggplot(df, ...)aesthetics (x, y, color, fill, shape, size, …)ggplot(df, aes(dimension = variable))geom_*+ geom_line()geoms inherit the aes of the plot if not specifiedaes vary with the datap + geom_point(aes(color=manufacturer, size = cyl))+facet_grid(.~fl)
A ggplot is made up of
And then you can change how things look and behave: - coordinate functions (changing the axis appearance and type) - scale functions (changing the appearance of the geoms) - theme functions (changing the appearance of the plot itself)
Plot types depend on the variable type
p <- ggplot(mpg, aes(drv)) p + geom_bar()
p <- ggplot(mpg, aes(drv)) p + geom_bar(aes(color=drv))
p <- ggplot(mpg, aes(drv)) p + geom_bar(aes(fill=drv))
p <- ggplot(mpg, aes(drv)) p + geom_bar(aes(fill=class))
p <- ggplot(mpg, aes(drv)) p + geom_bar(aes(fill=class), position = position_dodge())
p <- ggplot(mpg, aes(drv)) p + geom_bar(aes(fill=class), position = position_fill())
p <- ggplot(mpg, aes(hwy)) p + geom_histogram()
p + geom_histogram(bins = 10)
p + geom_histogram(bins = 100)
p + geom_dotplot(binwidth = 0.5)
p + geom_density()
p + geom_density(adjust = 3)
p + geom_density(adjust = 0.5)
Plot types depend on the variable type
if two variables are continuous, your choice is scatter
p <- ggplot(mpg, aes(x = cty, y = hwy)) p + geom_point()
still, you might just want to show the general tendency
p + geom_smooth()
or both
p + geom_smooth() + geom_point()
one variable discrete, the other continuous (note: it needs a
summarise())
mpg %>% group_by(manufacturer) %>% summarise(n = n()) %>% ggplot(aes(manufacturer, n))+ geom_col()
the above could have been easily done with geom_bar (that counts for us)
mpg %>% ggplot(aes(manufacturer))+ geom_bar()
but columns give you more options, since now you condition on a proper variable (n). For instance: order by n
mpg %>% group_by(manufacturer) %>% summarise(n = n()) %>% ggplot(aes(reorder(manufacturer, -n), n))+ geom_col()
boxplots show a distribution but can do so over different levels of a categorical var
mpg %>% ggplot(aes(drv, hwy))+ geom_boxplot()
boxplots are bulky and only show relevant info. Want full distribution? Use violins
mpg %>% ggplot(aes(drv, hwy))+ geom_violin()
remember: all is modular. We can always add color, fill…
mpg %>% ggplot(aes(drv, hwy, color = drv, fill = drv))+ geom_violin()
remember: all is modular. …facets
mpg %>% ggplot(aes(drv, hwy, color = drv, fill = drv))+ geom_violin()+ facet_grid(.~year)
if both variables are categorical, you can count their cross-tabulation
mpg %>% ggplot(aes(fl, drv))+ geom_count()
Plot types depend on the variable type
two variables define the x,y grid. A third defines the color of the cell. city consumption by year and drive (note: usually requires
summarise())
mpg %>% group_by(year, drv) %>% summarise(n = n()) %>% ggplot(aes(x = drv, y = year, fill = n)) + geom_tile()